Building Data Civilizer Pipelines with an Advanced Workflow Engine
نویسندگان
چکیده
In order for an enterprise to gain insight into its internal business and the changing outside environment, it is essential to provide the relevant data for in-depth analysis. Enterprise data is usually scattered across departments and geographic regions, and is often inconsistent. Data scientists spend the majority of their time finding, preparing, integrating, and cleaning relevant data sets. Data Civilizer is an end-toend data preparation system. In this paper, we present the complete system, focusing on our new workflow engine, a superior system for entity matching and consolidation, and new cleaning tools. Our workflow engine allows data scientists to author, execute and retrofit data preparation pipelines of different data discovery and cleaning services. Our end-to-end demo scenario is based on data from the MIT data warehouse and e-commerce data sets.
منابع مشابه
The Data Civilizer System
In many organizations, it is often challenging for users to find relevant data for specific tasks, since the data is usually scattered across the enterprise and often inconsistent. In fact, data scientists routinely report that the majority of their effort is spent finding, cleaning, integrating, and accessing data of interest to a task at hand. In order to decrease the “grunt work” needed to f...
متن کاملFastr: A Workflow Engine for Advanced Data Flows in Medical Image Analysis
With the increasing number of datasets encountered in imaging studies, the increasing complexity of processing workflows, and a growing awareness for data stewardship, there is a need for managed, automated workflows. In this paper, we introduce Fastr, an automated workflow engine with support for advanced data flows. Fastr has built-in data provenance for recording processing trails and ensuri...
متن کاملBuilding and Documenting Workflows with Python-Based Snakemake
Snakemake is a novel workflow engine with a simple Python-derived workflow definition language and an optimizing execution environment. It is the first system that supports multiple named wildcards (or variables) in input and output filenames of each rule definition. It also allows to write human-readable workflows that document themselves. We have found Snakemake especially useful for building...
متن کاملReproducible Large-Scale Neuroimaging Studies with the OpenMOLE Workflow Management System
OpenMOLE is a scientific workflow engine with a strong emphasis on workload distribution. Workflows are designed using a high level Domain Specific Language (DSL) built on top of Scala. It exposes natural parallelism constructs to easily delegate the workload resulting from a workflow to a wide range of distributed computing environments. OpenMOLE hides the complexity of designing complex exper...
متن کاملArchitectural Plan for Constructing Fault Tolerable Workflow Engines Based on Grid Service
In this paper the design and implementation of fault tolerable architecture for scientific workflow engines is presented. The engines are assumed to be implemented as composite web services. Current architectures for workflow engines do not make any considerations for substituting faulty web services with correct ones at run time. The difficulty is to rollback the execution state of the workflo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2018